Characterizing Weblog Corpora

نویسندگان

  • Fernando Perez-Tellez
  • David Pinto
  • John Cardiff
  • Paolo Rosso
چکیده

In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not extensive documents and exhibit undesirable characteristics from a clustering perspective such as low frequency terms, short vocabulary size and vocabulary overlapping of some domains. Furthermore, their characteristics vary widely depending on the specific interests of the writer, their linguistic style, and the volume of texts that they produce. In this work, we present a set of evaluation features by which we can establish the relative hardness of the clustering task, i.e., how easy or difficult it will be to accurately cluster the blog datasets. These are the shortness, domain broadness, class imbalance, stylometry, and structure. We report results obtained on corpora extracted from two popular blogging sites, Boing Boing (“B-B”) and Slashdot 1 . The results are contrasted with characterizations of a number of other corpora, consisting of newspaper articles and academic papers. We can use the results to provide knowledge of the most appropriate methodology for clustering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building Emotion Lexicon from Weblog Corpora

An emotion lexicon is an indispensable resource for emotion analysis. This paper aims to mine the relationships between words and emotions using weblog corpora. A collocation model is proposed to learn emotion lexicons from weblog articles. Emotion classification at sentence level is experimented by using the mined lexicons to demonstrate their usefulness.

متن کامل

Identifying Personal Narratives in Chinese Weblog PostsTitleIdentifying Personal Narratives in Chinese Weblog Posts

Automated text classification technologies have enabled researchers to amass enormous collections of personal narratives posted to English-language weblogs. In this paper, we explore analogous approaches to identify personal narratives in Chinese weblog posts as a precursor to the future empirical studies of cross-cultural differences in narrative structure. We describe the collection of over h...

متن کامل

Identifying Personal Narratives in Chinese Weblog Posts

Automated text classification technologies have enabled researchers to amass enormous collections of personal narratives posted to English-language weblogs. In this paper, we explore analogous approaches to identify personal narratives in Chinese weblog posts as a precursor to the future empirical studies of cross-cultural differences in narrative structure. We describe the collection of over h...

متن کامل

Minimal Narrative Annotation Schemes and Their Applications

The increased use of large corpora in narrative research has created new opportunities for empirical research and intelligent narrative technologies. To best exploit the value of these corpora, several research groups are eschewing complex discourse analysis techniques in favor of high-level minimalist narrative annotation schemes that can be quickly applied, achieve high inter-rater agreement,...

متن کامل

Leave a Reply: An Analysis of Weblog Comments

Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog comments and their relation to the posts. Using a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009